Integrating 3rd party search engines into KM index management

Table of Contents

 

Applies To:

Knowledge Management 6.0

Summary

The Knowledge Management (KM) and Repository Framework (RF) API offers the significant possibility of implementing your own index services for KM’s index management service. The index management service consolidates all search results coming from the different index service implementations and presents them in a unique way, using KM’s flexible user interface, to the end user. Results can be shuffled, grouped by certain criteria, and rendered by standard KM technology without any additional development effort.

Note: If you use the described coding and configuration to access externally provided search functionality, which your organization does not own, for example, a search service on the Web, you must request a licence from the provider. Failing to do so may be illegal. The conditions for such a licence are usually published on the Web pages of the Web search provider.

Relevant KM and RF APIs

Starting the implementation is quite easy. There is a set of APIs that must be used for implementation. The relevant packages of KM & RF APIs are:

Prerequisites for 3rd-Party Search Engines

To integrate a 3rd-party search engine, the following prerequisites should be fulfilled by the engine:

Architecture of KM’s Index Management

The index management service is one of KM’s global services that is not linked to a repository manager. The index management service provides functionality for managing index-based search & classification engines. In our case, we shall focus on search engine integration. The index management service can manage different types of indexes that are handled in different index collections. An index collection groups indexes of the same type. In our sample implementation, a 3rd-party search engine delivers its own index type. This means that we have to implement our own index collection and our own virtual index.

The search, which is normally triggered by a UI or using an API, is executed using the index management service. Based on the configured indexes, the index management service decides which virtual index is used for searching.

A 3rd-party search engine does not normally offer an interface to its index implementation even if an index exists. In this case, the virtual index of index management has to be implemented as a search trigger for the 3rd-party search engine. This means that instead of calling the index itself, a search is executed using the API of the 3rd-party search engine.

The result of such a search execution has to be transformed into KM index management compliant result objects. For this reason, certain information such as an access path, meta information, and ranking information has to be provided by the 3rd-party search engine. KM-compliant results must be accessible using a URI or CM-conform RIDs (Resource IDs) and need to be filled with KM- specific information such as a document ID, a title, or an author.

All these processes are described in detail in the following sections.

Coding in Detail

The coding part is divided into three implementation objects.

Implementing a Virtual Index

A virtual index implementation should extend the AbstractIndex (package: com.sapportals.wcm.service.indexmanagement) class and implement the ISearchIndex (package: com.sapportals.wcm.service.indexmanagement.retrieval.search) interface. The AbstractIndex class implements basic index managing functionality such as the creation and deletion of indexes. These functions can be used for managing the physical index of the 3rd-party search engine. If a 3rd-party search engine does not support indexes, the implementation is still necessary but could be left as an empty (dummy) index implementation.

public class CustomerSearchIndex extends AbstractIndex implements ISearchIndex ...

The generated index must be registered at the corresponding index collection implementation. This can be carried out when the constructor is called.

public CustomerSearchIndex(String indexId, String indexName, String indexGroup, IIndexFolderList indexFolders, String serviceId, String profileId, String serviceUserId, Properties indexServiceProperties) 
throws WcmException {
    super(...);          
    Collection indexes = new ArrayList();           
    indexes.add((IIndex) this);           
    m_searchIndexCollection = new CustomerSearchIndexCollection(indexes);   
}  

The index implementation must return a valid set of supported options. For search indexes, the options SupportedOption.SEARCH and SupportedOption.SEARCH_RESULTS_SORTED_BY_PROPERTY should be returned.

private final static ISupportedOptionSet SUPPORTED_OPTIONS = new SupportedOptionSet();  
static {     
    SUPPORTED_OPTIONS.add(SupportedOption.SEARCH);     
    SUPPORTED_OPTIONS.add(SupportedOption.SEARCH_RESULTS_SORTED_BY_PROPERTY);  
}  
:  
public ISupportedOptionSet getSupportedOptions() {  
    return SUPPORTED_OPTIONS;  
}

The service type defines the usage of the index within the index management service. A search index is used for this implementation, but it is also possible to have a search and classification type.

public List getServiceTypes() {
    List serviceTypes = new ArrayList();
    serviceTypes.add(IWcmIndexConst.SERVICE_TYPE_SEARCH);
    return serviceTypes;  
}  

The virtual index generation is carried out by calling the generate() method of the index implementation. By default, the internalSaveIndex() method of the IIndexService interface has to be called.

public void generate(IResourceContext ctxt) throws WcmException {
  IIndexService indexService = (IIndexService)
    ResourceFactory.getInstance().getServiceFactory().
         getService(IServiceTypesConst.INDEX_SERVICE);
				
    try {
        indexService.internalSaveIndex((IIndex) this);
    } catch (ResourceException e) {
        throw new WcmException(e);
    }
}

Methods to be implemented in the virtual index implementation:

Method

Purpose

getServiceTypes()

Gets the supported service types that this index supports. For search indexes, at least IWcmIndexConst.SERVICE_TYPE_SEARCH should be included

generate(IResourceContext ctxt)

Creates the virtual index in the index management object

delete(IResourceContext ctxt)

Deletes the virtual index in the index management object

getSupportedOptions()

Gets the options of this index. For search indexes, at least SupportedOption.SEARCH must be supported.

All other methods that are not mentioned here can be left with a default implementation or should throw a NotSupportedException(...). Example:

public ISearchSession searchSimilarDocumentsWithSession(IResourceList res) 
throws WcmException {
    throw new WcmException(
        new NotSupportedException(
            "Method searchSimilarDocumentsWithSession() not 
              supported on CustomerSearchIndex implementation.")
        );
  }  

Implementing a Virtual Index Collection

The ISearchIndexCollection implementation is the execution container of virtual indexes of the same type. Therefore, every 3rd-party search engine needs its own virtual index and collection implementation.

The collection implementation is the central access point for the index management service for triggering the search and fetching the search results from the 3rd-party engine.

An implementation can directly implement the ISearchIndexCollection interface or extend the abstract class AbstractSearchIndexCollection. The abstract class reduces the implementation time by offering multiple default implemented methods:

public class CustomerSearchIndexCollection extends AbstractSearchIndexCollection ...

The most important method signatures to be implemented are executeQueryWithSession(...) and getSearchResult(...). Both methods come with multiple method signatures that do not have to be implemented all, but have a redirect to the corresponding method with most of the parameters specified.

public ISearchSession executeQueryWithSession(IQueryEntryList queryEntryList,
   IResourceContext context, int initNumberMaxRawResults, ICollection searchFromHere, 
   ISortPropertyName sortProperty, Set set)
throws WcmException 

The executeQueryWithSession(...) method encapsulates the search result in a stateful session object. This session object can be created with the getNewSearchSession() method of the AbstractSearchIndexCollection class:

ISearchSession session =
   this.getNewSearchSession(queryEntryList,
         searchFromHere, this, context,
         initNumberMaxRawResults, sortProperty
   );
IRawSearchSession rawSession = (IRawSearchSession) session;  

The query parameters specified on the UI are passed to the executeQueryWithSession() method and must be transformed into 3rd-party search engine compliant queries:

String queryString = null;
IQueryEntryListIterator queriesIterator = queryEntryList.listIterator();
IQueryEntry query = null;
while (queriesIterator.hasNext()) {
    query = queriesIterator.next();
    if (query.getRowType().equals(IQueryEntry.ROW_TYPE_TERM)) {
        queryString += "+"+query.getValueAsString();
    }
}  

The query must than be passed to the 3rd-party search engine implementation, and the query is executed. The raw search results have to be placed into the search session object. The raw search results are now 3rd-party search engine dependent and need to be transformed into KM-compliant results in the next step.

FooSearch search = new FooSearch();   
FooResult[] resultElements = search.doSearch(queryString);           
List rawResultList =
     this.getRawSearchResultList(resultElements, rawSession, initNumberMaxRawResults);   
rawSession.setRawSearchResults(rawResultList);  

The raw search results are passed by the index management service to the implemented getSearchResults(...) method. The transformation of the raw results into KM compliant results must be carried out within this method. This means that all results of the 3rd-party search engine have to be transformed into IResource objects.

public ISearchResultList getSearchResults(List rawSearchResults, ResourceContext context) 
throws ResourceException   

To generate valid KM IResource objects, the result URL of the raw result has to be transformed into a KM RID object.

IURLGeneratorService urlGenerator = (IURLGeneratorService)
   ResourceFactory.getInstance().getServiceFactory().
     getService(IServiceTypes Const.URLGENERATOR_SERVICE);
				
IUri resultUri = UriFactory.parseUri(rawResult.getDocKey());  
RID rid = urlGenerator.mapUri(resultUri);  

For every RID object, a corresponding IResource object is created during Runtime. The IResource object has to be encapsulated in an ISearchResult object, which is the KM-compliant result format.

ISearchResult searchResult = this.createSearchResultObject(resource, 1.0F);

Methods to be implemented in the virtual index collection implementation are listed in the table below. Methods which are not listed here might be implemented or should throw a corresponding exception:

Method

Purpose

executeQueryWithSession()

Executes the query to a 3rd-party index or to the search engines interfaces and retrieves raw search results to a stateful session object.

executeQuery()

Similar to executeQueryWithSession(), but the results are not held in a session object. The search is stateless.

getSearchResults()

Retrieves the raw search result data and places it into KM-compliant result objects (ISearchResult). The objects accessible using a URI must be provided with a valid rank value.

searchSimilarDocumentsWithSession()

Returns result objects similar to the provided information data. This method might be implemented or should throw a corresponding NotSupportedException object.

Sequence of a Single Search

The sequence diagram below shows a single search request to the 3rd-party search engine implementation.

Implementing a Wrapper for 3rd-Party Search Engines

3rd-party search engines that do not offer a Java API for retrieving search results can also be used for integration in KM index management. As well as the Java API, the HTTP interface with XML as content format is also supported. It is therefore necessary to implement a Java HTTP Bridge. The Java HTTP Bridge implementation depends on the 3rd-party search engine result format.

The result of an HTTP response has to be parsed and transformed into KM-compliant result objects.

Take a look at the following example implementation of FooSearch:

public FooResult[] doSearch(String query) throws Exception {
    DOMParser parser = new DOMParser();
    parser.parse("http://www.foosearch.com/qxml.php?q="+query);
    this.m_resultDom = parser.getDocument();

					
return this.createFooResults();
}  

The 3rd-party search engine has its own schema for search results. They have to be transferred to KM’s index management. 3rd-party search engine metadata information has to be mapped to KM properties:

private FooResult[] createFooResults() {
    NodeList n = this.m_resultDom.getElementsByTagName("RESULT");
    FooResult[] result = new FooResult[n.getLength()];
    for (int i=0;i<n.getLength();i++) {
        result[i] = new FooResult();
        Node node = n.item(i);
        if (node.hasAttributes()) {
            NamedNodeMap nmap = node.getAttributes();
            result[i].setResultID(nmap.getNamedItem("ID").getNodeValue());
        }
        NodeList children = node.getChildNodes();
        for (int j=0;j<children.getLength();j++) {
            Node child = children.item(j);
            if (child.getNodeName().equalsIgnoreCase("title")) {
                result[i].setTitle(child.getChildNodes().item(0).getNodeValue());
            }
            if (child.getNodeName().equalsIgnoreCase("url")) {
                result[i].setUrl(child.getChildNodes().item(0).getNodeValue());
            }
        }
    }
    return result;
}  

The FooSearch instance is called within the executeQueryWithSession() method of the ISearchIndexCollection implementation:

public ISearchSession executeQueryWithSession(...) {
				
:
   FooSearch searchEngineObject = new FooSearch();
   if (searchEngineObject != null) {
       try {
           rawResultList = searchEngineObject.doSearch(query);
       } catch (Exception e1) {
           e1.printStackTrace();
       }
   }
}  

The transformation of the 3rd-party search results into KM-compliant result objects has to be carried out in the executeQueryWithSession() method.

Configuration and Deployment

The integration of the implementation into KM’s configuration requires the index implementation to be registered as configurable object in the configuration framework. For this reason, a new class definition and an instance file have to be created during development time.

Preparing the Configurables on IDE level

The class definition (CustomerIndexService.cc.xml) for a 3rd-party search engine should look like this example:

<ConfigClass name="CustomerIndexService" extends="IndexService">
  <attribute name="class" type="class" constant="com.customer.search.CustomerSearchIndex"/>
  <attribute name="indexconfigclass" type="string"/>   
</ConfigClass>  

Additional properties can be specified and evaluated by the index implementation.

The virtual index also needs its own class definition, even if it is an empty implementation. This definition is only used by the index management service to identify the corresponding 3rd-party search engine.

<ConfigClass name="CustomerIndex" extends="Index">  
</ConfigClass>  

The instance (CustomerIndexService.co.xml) of the defined classes above can also be specified during development time:

<Configurable configclass="CustomerIndexService">
   <property name="name" value="3rd party search engine name" />
   <property name="class" value=" com.customer.search.CustomerSearchIndex " />
   <property name="displayname" value="3rd party search engine name" />
   <property name="indexconfigclass" value="CustomerIndex" />   
</Configurable>  

The class definition and the instance file have to be bundled with the deployable unit of the search implementation.

Configuring the Search Implementation in CM

A repository is needed for storing the IResource objects temporarily. We recommended that you create a Web repository manager with a dedicated memory cache.

After the deployment of the 3rd-party search engine implementation, register the instance (CustomerIndexService.co.xml) of the search at the global service “Index Service”. You have to restart the J2EE server after the registration process.

Creating a Virtual Index Configuration

The virtual index is created with the standard index administration UI. Create an index instance similar to other search engines such as TREX. Assign the newly created web repository manager as the data source of the virtual index.

Restrictions

Currently there are certain restrictions to this kind of 3rd-party search engine integration.

Ranking Problems/Normalization

Ranking algorithms for several search engines differ in the result. This means that the same rank value from search engine A and from search engine B can express different rankings. However, if you merge search results from different search engine, a normalized rank value is needed. There is currently no normalization possibility between rank values of different search engines, because the algorithms used in the search engine implementation are not actually known.

Presentation of Results

Search results provided by the index management service are presented by the KM flexible user interface. This means that results are always presented using HTMLB technology. The flexible UI can be extended by implementing your own rendering controls such as collection renderers or resource renderers.

Sample 3rd party integration Eclipse project

The following archive contains a valid and deployable Eclipse project based on KM API of EP 6.0 SP2, which shows the steps described above in a whole example.

Download com.sap.netweaver.kmc.searchengine.zip